Project update

Section 1: Project Overview

Provide an overview of your project including the research questions. You may start with the text from your proposal but it must be edited for clarity and any updates to your thinking.

Formula 1 is a racing sport and the pinnacle of mechanical engineering where 10 teams (constructors) race each other for the world championship (mostly glory). We are considering data from 1950, the time that Formula 1 started, to 2017. As in any typical sport, the world championships of Formula 1 racing are decided by a number of parameters such as qualifying race times (these races usually happen before the main race to determine the starting positions of the cars in the main race) , lap times ( the time for a driver to go around a track once) and most importantly pit stop timings (when the tires of the cars are changed, the car is refueled, any broken parts are replaced) to name a few.

This is interesting because the person who is in front of the grid (a grid is the positioning of the cars from the start line of the race) may not necessarily win every time. Teams can follow a variety of different strategies, i.e., perform an undercut, overcut or push when required, in order to gain an advantage in the later stages of a specific race. This variation of data can be analyzed and visualized in numerous ways. Formula 1 being the bleeding edge of mechanical engineering provides a framework for the cars that are going to be designed in the future (usually the automobile technology that is being used in these cars, take a lot of time to be incorporated into normal cars) and therefore this analysis is significant.

Questions of Interest
● How many drivers who did not start the race in the first position went on to win?
● How many times has a constructor(teams) won at that specific venue over the years?
● How has the pitstop times changed over the years?
● Who is the driver who has won the most number of formula 1 championships in history?

Expected findings
● Our analysis indicates that over a period of time, there is a significant reduction in the pitstop time. ● The driver who has the maximum number of race wins, points and championships is Michael Schumacher.

As is, there has been no change to our research questions. Going forward, we intend to analyze car design changes over the years and how they have impacted results.

Section 2: Milestones and Progress

In bullet points or a table, list the project milestones from your proposal for dates up to November 4th. Have you met these milestones? If not, explain what you will do to catch up.

What resources will you need to accomplish your next steps? What is unclear in your plan?

● Data scraping and cleaning
Sumukh Sharma
10/18/2019

● Managing missing data
Sumukh Sharma
10/22/2019

  • We have created a document which has all the basic information about how we need to plug in data and manage missing information. It is as follows:

    Formula 1 data mapping and managing:

    1. Circuits: **
      circuitId - Use as key
      circuitRef
      name
      location
      country – Useful to specify by country
      lat - latitude
      lng - longitude
      alt
      url

    2. Constructors: **
      constructorId – use as key
      constructorRef
      name
      nationality
      url

    3. Drivers: **
      driverId – Use as Key
      driverRef
      number – Fill blanks with -1
      code
      forename
      surname
      dob
      nationality
      url

    4. Races: **
      raceId – Use as Key
      year – Season of the race
      round – Race number in that year
      circuitId – Same as Circuitid from circuit
      name
      date
      time – Fill blanks with 10:00
      url

    5. Results: ***
      resultId
      raceId – Use key from Races
      driverId – Use key from Drivers
      constructorId – use key storm constructors
      number – Driver number(5 for vettel, May not be unique). Fill Blanks with -1
      grid – Starting grid position
      position – Finishing position. Fill blanks with -1 for retired cards
      positionText – Position Text. R stands for retired
      positionOrder – Final position order ( This ideally should be the same as position 1…n where n is number of cars that took part in that race)
      points – Points won
      laps – Number of laps driven in a race. Any driver that retired will have a lower number than those that finished for a specific RaceID
      time
      milliseconds
      fastestLap – Gives fastest lap number
      rank – Rank of the fastest lap times
      fastestLapTime – Lap time
      fastestLapSpeed – Avg speed in the fastest lap
      statusId – StatusId of the car at the end of the race. Use key from Status ( Finished. Did not finish. Disqualified. Power unit issues. Etc.)

    6. Seasons: *
      Year - year
      url

    7. Status - **
      statusId – Use this as a key
      status – Status of the car at the end of the race and the reason for retiring.

    8. Constructor results: *
      constructorResultsId – Use as Key
      raceId - Use key from Races
      constructorId – Use key from Constructors
      points – Points scored
      status – Null/ D (D for disqualified)

    9. Constructor Standings:*
      constructorStandingsId
      raceId – Use key from Races
      constructorId – Use key from Constructors
      points – Points scored – Fill Blanks with 0
      position – Position of constructor
      positionText - Text
      wins – Wins in that year

    #### * indicate the importance of the data tables

    #### : foundation data table
    #### : Important data table
    ####
    : Critical data table

● Indexing, selection and filtering
Malav Parekh & Pratik Pandey
11/3/2019

  • Indexing of DataFrames has been done. Selection and filtering are not currently required, so they have been kept for the later stages of analysis. In order to clean the data, some DataFrames have been updated and null vlaues have been managed.

All milestones have been met so far

Since we have already imported the data, we will not need any further resources to proceed with our project implementation. However, in the coming days, as we progress, we intend to use 'Google Colab' for effective collaboration between the members of the project.

Section 3: Data Acquisition and Cleaning Code

Please provide code that demonstrates that you have made progress with data acquisition and cleaning. In a markdown cell at the top of the section, summarize what you have accomplished thus far.

1. We have imported the NumPy and Pandas modules.

2. We have read the .csv files (11 in total) relevant to our project and stored them in DataFrames

3. Three .csv files ('circuits.csv', 'pitStops.csv' and 'drivers.csv') were in TextIOWrapper form. We have converted them into DataFrames and done required cleaning and indexing.

4. We have created copies of these DataFrames and are going to work on these copies so that we can pull out the original tables in case of future use.

5. We have created indices for each of the DataFrames

6. We have managed missing values and eliminated NaN values for the DataFrames, including the most important DataFrame 'results' (its copy).

In [1]:
# Importing required modules
import pandas as pd
import numpy as np

driverStandings.csv

In [2]:
# Reading the 'driverStandings.csv' file and creating a copy of it.
driverStandings = pd.read_csv('driverStandings.csv')
driverStandingsCopy = driverStandings.copy()
#driverStandingsCopy
In [3]:
# Renaming its columns
driverStandingsCopy.rename(columns = {'driverStandingsId':'driverStandingsId', 'raceId':'raceId', 'driverId':'driverId',
                                      'points':'Points', 'position':'Position', 
                                      'positionText':'positionText', 'wins':'Wins'}, inplace = True)

# Setting the index to 'driverStandingsId'
driverStandingsCopy.set_index('driverStandingsId',inplace=True)

driverStandingsCopy
Out[3]:
raceId driverId Points Position positionText Wins
driverStandingsId
1 18 1 10.0 1 1 1
2 18 2 8.0 2 2 0
3 18 3 6.0 3 3 0
4 18 4 5.0 4 4 0
5 18 5 4.0 5 5 0
... ... ... ... ... ... ...
68604 988 836 5.0 18 18 0
68605 988 18 0.0 24 24 0
68606 988 814 0.0 25 25 0
68607 988 842 0.0 21 21 0
68608 988 843 0.0 23 23 0

31726 rows × 6 columns

results.csv

In [4]:
# Reading the 'results.csv' file and creating a copy of it.
results = pd.read_csv('results.csv')
resultsCopy= results.copy()
#resultsCopy
In [5]:
# Renaming its columns
resultsCopy.rename(columns = {'resultId':'resultId', 'raceId':'raceId', 'driverId':'driverId',
                                      'constructorId':'constructorId', 'number':'Number', 'grid': 'Grid',
                              'position':'Position', 'positionText':'positionText', 'positionOrder': 'Position order',
                              'points':'Points', 'laps':'Laps', 'time':'Time','milliseconds':'Time in milliseconds',
                             'fastestLap': 'Fastest Lap', 'rank': 'Rank', 'fastestLapTime': 'Fastest Lap Time',
                             'fastestLapSpeed':'Fastest Lap Speed', 'statusId':'statusId'}, inplace = True)

resultsCopy.set_index('resultId',inplace=True)
In [6]:
#Handling null values for multiple columns in 'results.csv'
values = {'Position': -1, 'Number': -1, 'Time': 0, 'Time in milliseconds': 0}
resultsCopy.fillna(value=values)
Out[6]:
raceId driverId constructorId Number Grid Position positionText Position order Points Laps Time Time in milliseconds Fastest Lap Rank Fastest Lap Time Fastest Lap Speed statusId
resultId
1 18 1 1 22.0 1 1.0 1 1 10.0 58 34:50.6 5690616.0 39.0 2.0 01:27.5 218.3 1
2 18 2 2 3.0 5 2.0 2 2 8.0 58 5.478 5696094.0 41.0 3.0 01:27.7 217.586 1
3 18 3 3 7.0 7 3.0 3 3 6.0 58 8.163 5698779.0 41.0 5.0 01:28.1 216.719 1
4 18 4 4 5.0 11 4.0 4 4 5.0 58 17.181 5707797.0 58.0 7.0 01:28.6 215.464 1
5 18 5 1 23.0 3 5.0 5 5 4.0 58 18.014 5708630.0 43.0 1.0 01:27.4 218.385 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23777 988 842 5 10.0 17 16.0 16 16 0.0 54 0 0.0 33.0 16.0 01:43.8 192.542 11
23778 988 828 15 9.0 19 17.0 17 17 0.0 54 0 0.0 36.0 15.0 01:43.6 193.057 11
23779 988 840 3 18.0 15 18.0 18 18 0.0 54 0 0.0 52.0 6.0 01:42.3 195.402 11
23780 988 832 4 55.0 12 -1.0 R 19 0.0 31 0 0.0 26.0 14.0 01:43.4 193.41 36
23781 988 817 9 3.0 4 -1.0 R 20 0.0 20 0 0.0 13.0 12.0 01:42.8 194.579 9

23777 rows × 17 columns

circuits.csv

In [7]:
# Opening the circuits.csv file
# It is a TextIO Wrapper file
circuitsWrapper = open('circuits.csv', encoding = "ISO-8859-1")
circuits = pd.DataFrame(circuitsWrapper)
In [8]:
# Appending the valuesinto a list
list_circuits=[]
# Since the values are comma seperated and stored in a list, we split them
rowLabels_circuits=circuits[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,74):
    values=circuits[0][i].split(',')
    list_circuits.append(values)
# Storing the newly appended list in a DataFrame
circuitsUpdated=pd.DataFrame(list_circuits)  
#circuitsUpdated
In [9]:
# Creating a copy of the above created DataFrame
circuitsCopy= circuitsUpdated.copy()
#circuitsCopy
In [10]:
# Renaming the columns
circuitsCopy.rename(columns = {0:'circuitId', 1:'CircuitRef', 2:'Name', 3:'Location', 
                               4:'country', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)
#circuitsCopy
In [11]:
# Setting the index as 'circuitId'
circuitsCopy.set_index('circuitId',inplace=True)
In [12]:
# # Dropping redundant columns
circuitsCopy.drop(['Altitude', 9], axis=1, inplace=True)
circuitMap=circuitsCopy
circuitsCopy
Out[12]:
CircuitRef Name Location country Latitude Longitude URL
circuitId
1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 http://en.wikipedia.org/wiki/Sepang_Internatio...
3 bahrain Bahrain International Circuit Sakhir Bahrain 26.0325 50.5106 http://en.wikipedia.org/wiki/Bahrain_Internati...
4 catalunya Circuit de Barcelona-Catalunya MontmelÌ_ Spain 41.57 2.26111 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
5 istanbul Istanbul Park Istanbul Turkey 40.9517 29.405 http://en.wikipedia.org/wiki/Istanbul_Park\n
... ... ... ... ... ... ... ...
69 americas Circuit of the Americas Austin USA 30.1328 -97.6411 http://en.wikipedia.org/wiki/Circuit_of_the_Am...
70 red_bull_ring Red Bull Ring Spielburg Austria 47.2197 14.7647 http://en.wikipedia.org/wiki/Red_Bull_Ring\n
71 sochi Sochi Autodrom Sochi Russia 43.4057 39.9578 http://en.wikipedia.org/wiki/Sochi_Autodrom\n
72 port_imperial Port Imperial Street Circuit New Jersey USA 40.7769 -74.0111 http://en.wikipedia.org/wiki/Port_Imperial_Str...
73 BAK Baku City Circuit Baku Azerbaijan 40.3725 49.8533 http://en.wikipedia.org/wiki/Baku_City_Circuit

73 rows × 7 columns

In [13]:
circuitMap
Out[13]:
CircuitRef Name Location country Latitude Longitude URL
circuitId
1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 http://en.wikipedia.org/wiki/Sepang_Internatio...
3 bahrain Bahrain International Circuit Sakhir Bahrain 26.0325 50.5106 http://en.wikipedia.org/wiki/Bahrain_Internati...
4 catalunya Circuit de Barcelona-Catalunya MontmelÌ_ Spain 41.57 2.26111 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
5 istanbul Istanbul Park Istanbul Turkey 40.9517 29.405 http://en.wikipedia.org/wiki/Istanbul_Park\n
... ... ... ... ... ... ... ...
69 americas Circuit of the Americas Austin USA 30.1328 -97.6411 http://en.wikipedia.org/wiki/Circuit_of_the_Am...
70 red_bull_ring Red Bull Ring Spielburg Austria 47.2197 14.7647 http://en.wikipedia.org/wiki/Red_Bull_Ring\n
71 sochi Sochi Autodrom Sochi Russia 43.4057 39.9578 http://en.wikipedia.org/wiki/Sochi_Autodrom\n
72 port_imperial Port Imperial Street Circuit New Jersey USA 40.7769 -74.0111 http://en.wikipedia.org/wiki/Port_Imperial_Str...
73 BAK Baku City Circuit Baku Azerbaijan 40.3725 49.8533 http://en.wikipedia.org/wiki/Baku_City_Circuit

73 rows × 7 columns

constructorResults.csv

In [14]:
# Reading the 'constructorResults.csv' file and creating a copy of it.
constructorResults = pd.read_csv('constructorResults.csv')
constructorResultsCopy= constructorResults.copy()
#constructorResultsCopy
In [15]:
# Renaming its columns
constructorResultsCopy.rename(columns = {'constructorResultsId':'constructorResultsId', 'raceId':'raceId',
                                         'constructorId':'constructorId', 'points':'Points', 
                               'status':'Status', 5:'Latitude', 6:'Longitude', 7:'Altitude', 8:'URL'}, inplace = True)

# Setting the index
constructorResultsCopy.set_index('constructorResultsId', inplace=True)
#constructorResultsCopy
In [16]:
#Handling null values
values = {'Status': 'NULL'}
constructorResultsCopy.fillna(value=values)
Out[16]:
raceId constructorId Points Status
constructorResultsId
1 18 1 14.0 NULL
2 18 2 8.0 NULL
3 18 3 9.0 NULL
4 18 4 5.0 NULL
5 18 5 2.0 NULL
... ... ... ... ...
15635 988 1 2.0 NULL
15636 988 3 1.0 NULL
15637 988 210 0.0 NULL
15638 988 15 0.0 NULL
15639 988 5 0.0 NULL

11142 rows × 4 columns

constructors.csv

In [17]:
# Reading the 'constructors.csv' file and creating a copy of it.
constructors = pd.read_csv('constructors.csv')
constructorCopy= constructors.copy()
#constructorCopy
In [18]:
# Renaming the columns
constructorCopy.rename(columns = {'constructorId':'constructorId', 'constructorRef':'ConstructorRef',
                                         'name':'Name', 'nationality':'Nationality', 'url':'URL',
                                  'Unnamed: 5':'Unnamed: 5'}, inplace = True)
# Setting the index
constructorCopy.set_index('constructorId',inplace=True)
In [19]:
# Dropping redundant column
constructorCopy.drop(['Unnamed: 5'], axis=1, inplace=True)
constructorCopy
Out[19]:
ConstructorRef Name Nationality URL
constructorId
1 mclaren McLaren British http://en.wikipedia.org/wiki/McLaren
2 bmw_sauber BMW Sauber German http://en.wikipedia.org/wiki/BMW_Sauber
3 williams Williams British http://en.wikipedia.org/wiki/Williams_Grand_Pr...
4 renault Renault French http://en.wikipedia.org/wiki/Renault_F1
5 toro_rosso Toro Rosso Italian http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso
... ... ... ... ...
206 marussia Marussia Russian http://en.wikipedia.org/wiki/Marussia_F1
207 caterham Caterham Malaysian http://en.wikipedia.org/wiki/Caterham_F1
208 lotus_f1 Lotus F1 British http://en.wikipedia.org/wiki/Lotus_F1
209 manor Manor Marussia British http://en.wikipedia.org/wiki/Manor_Motorsport
210 haas Haas F1 Team American http://en.wikipedia.org/wiki/Haas_F1_Team

208 rows × 4 columns

constructorStandings.csv

In [20]:
# Reading the 'constructorStandings.csv' file and creating a copy of it.
constructorStandings = pd.read_csv('constructorStandings.csv')
constructorStandingsCopy = constructorStandings.copy()
#constructorStandingsCopy
In [21]:
# Renaming the columns
constructorStandingsCopy.rename(columns = {'constructorStandingsId':'constructorStandingsId', 'raceId':'raceId',
                                         'constructorId':'constructorId', 'points':'Points', 'position':'Position',
                                  'positionText':'positionText','wins':'Wins', 'Unnamed: 7':'Unnamed: 7'}, inplace = True)

# Setting the index
constructorStandingsCopy.set_index('constructorStandingsId',inplace=True)
In [22]:
# Dropping redundant column
constructorStandingsCopy.drop(['Unnamed: 7'], axis=1, inplace=True)
constructorStandingsCopy
Out[22]:
raceId constructorId Points Position positionText Wins
constructorStandingsId
1 18 1 14.0 1 1 1
2 18 2 8.0 3 3 0
3 18 3 9.0 2 2 0
4 18 4 5.0 4 4 0
5 18 5 2.0 5 5 0
... ... ... ... ... ... ...
26928 988 5 53.0 7 7 0
26929 988 4 57.0 6 6 0
26930 988 15 5.0 10 10 0
26931 988 1 30.0 9 9 0
26932 988 210 47.0 8 8 0

11896 rows × 6 columns

drivers.csv

In [23]:
# Opeing the 'drivers.csv' TextIOWrapper file
driversWrapper = open('drivers.csv', encoding = "ISO-8859-1")
drivers = pd.DataFrame(driversWrapper)
In [24]:
# Creating a list
list_drivers=[]
# Splitting the comma seperated values
rowLabels_drivers=drivers[0][0].split(',')
# Looping through the length of the DataFrame
for i in range(1,843):
    values=drivers[0][i].split(',')
    # Appending the values to the list
    list_drivers.append(values)
# Creating a new DataFrame by using the list
driversUpdated=pd.DataFrame(list_drivers)  
#driversUpdated
In [25]:
# Creating a copy of the DataFrame
driversCopy= driversUpdated.copy()
#driversCopy
In [26]:
# Renaming the columns
driversCopy.rename(columns = {0:'driverId', 1:'DriverRef', 2:'Number', 3:'Code', 
                               4:'Forename', 5:'Surname', 6:'DOB', 7:'Nationality', 8:'URL'}, inplace = True)
#driversCopy
In [27]:
# Setting the index
driversCopy.set_index('driverId',inplace=True)
In [28]:
# Dropping redundant column
driversCopy.drop([9], axis=1, inplace=True) 
driversCopy
Out[28]:
DriverRef Number Code Forename Surname DOB Nationality URL
driverId
1 hamilton 44 HAM Lewis Hamilton 07/01/1985 British http://en.wikipedia.org/wiki/Lewis_Hamilton\n
2 heidfeld HEI Nick Heidfeld 10/05/1977 German http://en.wikipedia.org/wiki/Nick_Heidfeld\n
3 rosberg 6 ROS Nico Rosberg 27/06/1985 German http://en.wikipedia.org/wiki/Nico_Rosberg\n
4 alonso 14 ALO Fernando Alonso 29/07/1981 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso\n
5 kovalainen KOV Heikki Kovalainen 19/10/1981 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen\n
... ... ... ... ... ... ... ... ...
838 vandoorne 2 VAN Stoffel Vandoorne 26/03/1992 Belgian http://en.wikipedia.org/wiki/Stoffel_Vandoorne\n
839 ocon 31 OCO Esteban Ocon 17/09/1996 French http://en.wikipedia.org/wiki/Esteban_Ocon\n
840 stroll 18 STR Lance Stroll 29/10/1998 Canadian http://en.wikipedia.org/wiki/Lance_Stroll\n
841 giovinazzi 36 GIO Antonio Giovinazzi 14/12/1993 Italian http://en.wikipedia.org/wiki/Antonio_Giovinazzi\n
843 brendon_hartley 39 HAR Brendon Hartley 10/11/1989 New Zealander http://en.wikipedia.org/wiki/Brendon_Hartley

842 rows × 8 columns

In [29]:
# Opening the 'pitStops.csv' file and creating a copy of it after converting it to DataFrame from TextIOWrapper.
pitStopsWrapper = open('pitStops.csv', encoding = "ISO-8859-1")
pitStops = pd.DataFrame(pitStopsWrapper)
#pitStops
In [30]:
# Create a list
list_pitStops=[]
# Splitting the comma seperated values in the wrapper file
rowLabels_pitStops=pitStops[0][0].split(',')
# Looping through the length of the Dataframe
for i in range(1,6252):
    values=pitStops[0][i].split(',')
    # Appending the values to the list
    list_pitStops.append(values)
# Creating a new DataFrame and adding the list to it
pitStopsUpdated=pd.DataFrame(list_pitStops)  
#pitStopsUpdated
In [31]:
# Creating a copy of the DataFrame
pitStopsCopy= pitStopsUpdated.copy()
#pitStopsCopy
In [32]:
# Renaming the columns
pitStopsCopy.rename(columns = {0:'raceId', 1:'driverId', 2:'Stop', 3:'Lap', 
                               4:'Time', 5:'Duration', 6:'Duration in milliseconds'}, inplace = True)
#pitStopsCopy
In [33]:
# Setting the index
pitStopsCopy.set_index('raceId',inplace=True)
In [34]:
pitStopsCopy
Out[34]:
driverId Stop Lap Time Duration Duration in milliseconds
raceId
841 153 1 1 17:05:23 26.898 26898\n
841 30 1 1 17:05:52 25.021 25021\n
841 17 1 11 17:20:48 23.426 23426\n
841 4 1 12 17:22:34 23.251 23251\n
841 13 1 13 17:24:10 23.842 23842\n
... ... ... ... ... ... ...
988 828 1 30 17:56:57 22.176 22176\n
988 839 1 31 17:57:47 21.865 21865\n
988 832 1 31 17:57:49 21.956 21956\n
988 154 1 32 17:59:52 21.85 21850\n
988 840 3 35 18:06:11 21.908 21908\n

6251 rows × 6 columns

races.csv

In [35]:
# Reading the 'races.csv' file and creating a copy of it
races = pd.read_csv('races.csv')
racesCopy = races.copy()
#racesCopy
In [36]:
# Renaming the columns
racesCopy.rename(columns = {'raceId':'raceId', 'year':'Year','round':'Round', 'circuitId':'circuitId', 'name':'Name',
                                  'date':'Date','time':'Time', 'url':'URL'}, inplace = True)
# Setting the index
racesCopy.set_index('raceId',inplace=True)
race2=racesCopy
racesCopy
Out[36]:
Year Round circuitId Name Date Time URL Weather
raceId
1 2009 1 1 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny
2 2009 2 2 Malaysian Grand Prix 4/5/2009 9:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... Rainy
3 2009 3 17 Chinese Grand Prix 4/19/2009 7:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... Rainy
4 2009 4 3 Bahrain Grand Prix 4/26/2009 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... Sunny
5 2009 5 4 Spanish Grand Prix 5/10/2009 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... Sunny
... ... ... ... ... ... ... ... ...
1005 2018 17 22 Japanese Grand Prix 10/7/2018 5:00:00 http://en.wikipedia.org/wiki/2018_Japanese_Gra... Dry
1006 2018 18 69 United States Grand Prix 10/21/2018 19:00:00 http://en.wikipedia.org/wiki/2018_United_State... Dry
1007 2018 19 32 Mexican Grand Prix 10/28/2018 19:00:00 http://en.wikipedia.org/wiki/2018_Mexican_Gran... Overcast
1008 2018 20 18 Brazilian Grand Prix 11/11/2018 16:00:00 http://en.wikipedia.org/wiki/2018_Brazilian_Gr... Cloudy
1009 2018 21 24 Abu Dhabi Grand Prix 11/25/2018 13:00:00 http://en.wikipedia.org/wiki/2018_Abu_Dhabi_Gr... Sunny

997 rows × 8 columns

seasons.csv

In [37]:
# Reading the 'seasons.csv' file and creating a copy of it
seasons = pd.read_csv('seasons.csv')
seasonsCopy= seasons.copy()
#seasonsCopy
In [38]:
# Renaming the columns
seasonsCopy.rename(columns = {'year':'Year','url':'URL'}, inplace = True)

# Setting the index
seasonsCopy.set_index('Year',inplace=True)
seasonsCopy
Out[38]:
URL
Year
2009 http://en.wikipedia.org/wiki/2009_Formula_One_...
2008 http://en.wikipedia.org/wiki/2008_Formula_One_...
2007 http://en.wikipedia.org/wiki/2007_Formula_One_...
2006 http://en.wikipedia.org/wiki/2006_Formula_One_...
2005 http://en.wikipedia.org/wiki/2005_Formula_One_...
... ...
2014 http://en.wikipedia.org/wiki/2014_Formula_One_...
2015 http://en.wikipedia.org/wiki/2015_Formula_One_...
2016 https://en.wikipedia.org/wiki/2016_Formula_One...
2017 https://en.wikipedia.org/wiki/2017_Formula_One...
2018 http://en.wikipedia.org/wiki/2018_FIA_Formula_...

69 rows × 1 columns

status.csv

In [39]:
# Reading the 'status.csv' file and creating a copy of it
status = pd.read_csv('status.csv')
statusCopy = status.copy()
#statusCopy
In [40]:
# Renaming the columns
statusCopy.rename(columns = {'statusId':'statusId','status':'Status'}, inplace = True)

# Setting the index
statusCopy.set_index('statusId',inplace=True)
statusCopy
Out[40]:
Status
statusId
1 Finished
2 Disqualified
3 Accident
4 Collision
5 Engine
... ...
132 ERS
133 +49 Laps
134 +38 Laps
135 Brake duct
136 Seat

134 rows × 1 columns

Analysis 1: Drivers who won after not being in pole position

In [123]:
print("Number of winners: "+str(len(results[results['position'] == 1.0])))
Number of winners: 979
In [42]:
resultsCopy=resultsCopy[resultsCopy['Position'] == 1.0]
In [43]:
resultsCopy=resultsCopy[resultsCopy['Grid'] != 1]
In [124]:
print("Number of winners who were not in grid position 1: "+str(len(resultsCopy)))
Number of winners who were not in grid position 1: 572

So the number of drivers who won the race even after not starting in pole position was 572 out of 979. The rate of conversion stands pretty high at 58.4% which is massive and does not give us a meaningful analysis. Hence, we will focus on drivers who didn't start in the first 3 posiitons and still went on to win, to gain better insight of how difficult it actually is.

In [45]:
required_drivers = resultsCopy[resultsCopy['Grid'] > 3].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
Out[45]:
raceId driverId constructorId Number Grid Position positionText Position order Points Laps Time Time in milliseconds Fastest Lap Rank Fastest Lap Time Fastest Lap Speed statusId
resultId
10684 453 187 1 7.0 22 1.0 1 1 9.0 75 53:34.9 6814889.0 NaN NaN NaN NaN 1
19232 800 657 113 14.0 19 1.0 1 1 8.0 200 49:17.3 13757270.0 NaN NaN NaN NaN 1
3150 168 22 6 4.0 18 1.0 1 1 10.0 45 25:34.4 5134418.0 NaN NaN NaN NaN 1
11250 473 187 1 7.0 17 1.0 1 1 9.0 62 58:41.0 7121043.0 NaN NaN NaN NaN 1
4971 250 30 22 1.0 16 1.0 1 1 10.0 44 36:47.9 5807875.0 NaN NaN NaN NaN 1
289 32 4 4 5.0 15 1.0 1 1 10.0 61 57:16.3 7036304.0 55.0 3.0 01:45.8 172.464 1
3590 188 65 24 17.0 14 1.0 1 1 10.0 66 41:54.3 6114314.0 NaN NaN NaN NaN 1
18758 778 611 108 9.0 13 1.0 1 1 8.0 200 41:14.3 13274250.0 NaN NaN NaN NaN 1
17913 741 394 6 50.0 12 1.0 1 1 9.0 52 14:17.5 8057500.0 NaN NaN NaN NaN 1
13251 543 222 59 20.0 11 1.0 1 1 9.0 53 40:11.2 6011190.0 NaN NaN NaN NaN 1
10078 430 182 1 1.0 10 1.0 1 1 9.0 70 32:29.3 5549263.0 NaN NaN NaN NaN 1
13678 558 222 59 20.0 9 1.0 1 1 9.0 80 40:00.0 6000000.0 NaN NaN NaN NaN 1
15934 645 358 32 3.0 8 1.0 1 1 9.0 80 54:37.4 6877400.0 NaN NaN NaN NaN 1
2667 146 14 1 4.0 7 1.0 1 1 10.0 71 27:45.9 5265927.0 NaN NaN NaN NaN 1
22262 906 817 9 3.0 6 1.0 1 1 25.0 70 39:12.8 5952830.0 68.0 4.0 01:18.6 199.638 1
15340 621 304 1 12.0 5 1.0 1 1 9.0 79 45:49.1 6349100.0 NaN NaN NaN NaN 1
6447 304 77 1 2.0 4 1.0 1 1 10.0 81 46:54.8 6414786.0 NaN NaN NaN NaN 1

As we can see, if we filter out drivers who started the races in first three grid positions and went on to win the race, we are only left with 17 entries out 979. This is a massive drop in proportion from 58.4% to 1.7%. This gives us an idea of how difficult it is to win races from grid positions 4 and beyond.

But some drivers have managed the extraordinary.

Let us focus on those who achieved the impossible, by starting in positions 15th and beyond and still went on to win it.

In [46]:
required_drivers = resultsCopy[resultsCopy['Grid'] > 15].sort_values('Grid',ascending=False).drop_duplicates(subset='Grid',keep='first')
required_drivers
Out[46]:
raceId driverId constructorId Number Grid Position positionText Position order Points Laps Time Time in milliseconds Fastest Lap Rank Fastest Lap Time Fastest Lap Speed statusId
resultId
10684 453 187 1 7.0 22 1.0 1 1 9.0 75 53:34.9 6814889.0 NaN NaN NaN NaN 1
19232 800 657 113 14.0 19 1.0 1 1 8.0 200 49:17.3 13757270.0 NaN NaN NaN NaN 1
3150 168 22 6 4.0 18 1.0 1 1 10.0 45 25:34.4 5134418.0 NaN NaN NaN NaN 1
1475 88 8 1 9.0 17 1.0 1 1 10.0 53 29:02.2 5342212.0 44.0 1.0 01:31.5 228.372 1
4971 250 30 22 1.0 16 1.0 1 1 10.0 44 36:47.9 5807875.0 NaN NaN NaN NaN 1
In [47]:
df=pd.DataFrame()
df = [ driversCopy[driversCopy.index == j] for i in required_drivers['driverId'] for j in driversCopy.index if int(j) == i]
#df
In [48]:
drivers_url=[]
for i in range(len(df)):
   
    drivers_url.append(df[i].URL)
In [49]:
from bs4 import BeautifulSoup
import requests
In [50]:
from IPython.display import Image
In [51]:
path = "/Users/prati/FormulaOne/Images/"
Image(filename = path + "Watson.jpg", width=250, height=250)
Out[51]:

John Watson

  • United States Grand Prix, 1983
  • Grid position: 22
In [52]:
Image(filename = path + "Cross.png", width=250, height=250)
Out[52]:

Art Cross

  • Indianapolis 500, 1954
  • Grid Position: 19
In [53]:
Image(filename = path + "Barrichello.png", width=250, height=250)
Out[53]:

Rubens Barrichello

  • German Grand Prix, 2000
  • Grid Position: 18
In [54]:
Image(filename = path + "Raikkonen.png", width=250, height=250)
Out[54]:

Kimi Raikkonen

  • Japanese Grand Prix, 2005
  • Grid Position: 17
In [55]:
Image(filename = path + "Schumi.png", width=250, height=250)
Out[55]:

Michael Schumacher

  • Belgian Grand Prix, 1995
  • Grid Position: 16

Analysis 2: But who is the greatest in history???????

In [56]:
pit_time_list1 = pd.DataFrame(pitStopsCopy.loc['881']['Duration'])
mean_list_for_a_race1 = pit_time_list1['Duration'].astype(float).mean()

mean_list_for_a_race1
Out[56]:
66.45454794520548
In [57]:
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 99)



driverStandingsCopy = driverStandingsCopy.fillna(0)
driverStandingsCopy
Out[57]:
raceId driverId Points Position positionText Wins
driverStandingsId
1 18 1 10.0 1 1 1
2 18 2 8.0 2 2 0
3 18 3 6.0 3 3 0
4 18 4 5.0 4 4 0
5 18 5 4.0 5 5 0
... ... ... ... ... ... ...
68604 988 836 5.0 18 18 0
68605 988 18 0.0 24 24 0
68606 988 814 0.0 25 25 0
68607 988 842 0.0 21 21 0
68608 988 843 0.0 23 23 0

31726 rows × 6 columns

In [58]:
driver_wins = {}
for index, row in driverStandingsCopy.iterrows():
    if row["Position"] == 1:
        if row["driverId"] in driver_wins.keys():
            driver_wins[row["driverId"]] += 1
        else:
            driver_wins[row["driverId"]] = 1
#driver_wins
In [59]:
driverStandingsDf= pd.DataFrame(driverStandingsCopy)

driversDf= pd.DataFrame(driversCopy)
In [60]:
driverStandingsDf=driverStandingsDf.sort_values('driverId')
In [61]:
driversCopy.head()
Out[61]:
DriverRef Number Code Forename Surname DOB Nationality URL
driverId
1 hamilton 44 HAM Lewis Hamilton 07/01/1985 British http://en.wikipedia.org/wiki/Lewis_Hamilton\n
2 heidfeld HEI Nick Heidfeld 10/05/1977 German http://en.wikipedia.org/wiki/Nick_Heidfeld\n
3 rosberg 6 ROS Nico Rosberg 27/06/1985 German http://en.wikipedia.org/wiki/Nico_Rosberg\n
4 alonso 14 ALO Fernando Alonso 29/07/1981 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso\n
5 kovalainen KOV Heikki Kovalainen 19/10/1981 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen\n
In [62]:
driversCopy.reset_index(inplace=True)
In [63]:
print(driversCopy.dtypes, driverStandingsCopy.dtypes)
driverId       object
DriverRef      object
Number         object
Code           object
Forename       object
Surname        object
DOB            object
Nationality    object
URL            object
dtype: object raceId            int64
driverId          int64
Points          float64
Position          int64
positionText     object
Wins              int64
dtype: object
In [64]:
driversCopy['driverId']=driversCopy['driverId'].astype('int')
In [65]:
driverStandings
Out[65]:
driverStandingsId raceId driverId points position positionText wins
0 1 18 1 10.0 1 1 1
1 2 18 2 8.0 2 2 0
2 3 18 3 6.0 3 3 0
3 4 18 4 5.0 4 4 0
4 5 18 5 4.0 5 5 0
... ... ... ... ... ... ... ...
31721 68604 988 836 5.0 18 18 0
31722 68605 988 18 0.0 24 24 0
31723 68606 988 814 0.0 25 25 0
31724 68607 988 842 0.0 21 21 0
31725 68608 988 843 0.0 23 23 0

31726 rows × 7 columns

In [66]:
data=pd.merge(driverStandingsCopy, driversCopy, on='driverId', how='left')
data
Out[66]:
raceId driverId Points Position positionText Wins DriverRef Number Code Forename Surname DOB Nationality URL
0 18 1 10.0 1 1 1 hamilton 44 HAM Lewis Hamilton 07/01/1985 British http://en.wikipedia.org/wiki/Lewis_Hamilton\n
1 18 2 8.0 2 2 0 heidfeld HEI Nick Heidfeld 10/05/1977 German http://en.wikipedia.org/wiki/Nick_Heidfeld\n
2 18 3 6.0 3 3 0 rosberg 6 ROS Nico Rosberg 27/06/1985 German http://en.wikipedia.org/wiki/Nico_Rosberg\n
3 18 4 5.0 4 4 0 alonso 14 ALO Fernando Alonso 29/07/1981 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso\n
4 18 5 4.0 5 5 0 kovalainen KOV Heikki Kovalainen 19/10/1981 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen\n
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31721 988 836 5.0 18 18 0 wehrlein 94 WEH Pascal Wehrlein 18/10/1994 German http://en.wikipedia.org/wiki/Pascal_Wehrlein\n
31722 988 18 0.0 24 24 0 button 22 BUT Jenson Button 19/01/1980 British http://en.wikipedia.org/wiki/Jenson_Button\n
31723 988 814 0.0 25 25 0 resta DIR Paul di Resta 16/04/1986 British http://en.wikipedia.org/wiki/Paul_di_Resta\n
31724 988 842 0.0 21 21 0 gasly 10 GAS Pierre Gasly 07/02/1996 French http://en.wikipedia.org/wiki/Pierre_Gasly\n
31725 988 843 0.0 23 23 0 brendon_hartley 39 HAR Brendon Hartley 10/11/1989 New Zealander http://en.wikipedia.org/wiki/Brendon_Hartley

31726 rows × 14 columns

In [67]:
data.groupby(['DriverRef']).count().sort_values(by='Wins', ascending=False, inplace=True)
data
Out[67]:
raceId driverId Points Position positionText Wins DriverRef Number Code Forename Surname DOB Nationality URL
0 18 1 10.0 1 1 1 hamilton 44 HAM Lewis Hamilton 07/01/1985 British http://en.wikipedia.org/wiki/Lewis_Hamilton\n
1 18 2 8.0 2 2 0 heidfeld HEI Nick Heidfeld 10/05/1977 German http://en.wikipedia.org/wiki/Nick_Heidfeld\n
2 18 3 6.0 3 3 0 rosberg 6 ROS Nico Rosberg 27/06/1985 German http://en.wikipedia.org/wiki/Nico_Rosberg\n
3 18 4 5.0 4 4 0 alonso 14 ALO Fernando Alonso 29/07/1981 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso\n
4 18 5 4.0 5 5 0 kovalainen KOV Heikki Kovalainen 19/10/1981 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen\n
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31721 988 836 5.0 18 18 0 wehrlein 94 WEH Pascal Wehrlein 18/10/1994 German http://en.wikipedia.org/wiki/Pascal_Wehrlein\n
31722 988 18 0.0 24 24 0 button 22 BUT Jenson Button 19/01/1980 British http://en.wikipedia.org/wiki/Jenson_Button\n
31723 988 814 0.0 25 25 0 resta DIR Paul di Resta 16/04/1986 British http://en.wikipedia.org/wiki/Paul_di_Resta\n
31724 988 842 0.0 21 21 0 gasly 10 GAS Pierre Gasly 07/02/1996 French http://en.wikipedia.org/wiki/Pierre_Gasly\n
31725 988 843 0.0 23 23 0 brendon_hartley 39 HAR Brendon Hartley 10/11/1989 New Zealander http://en.wikipedia.org/wiki/Brendon_Hartley

31726 rows × 14 columns

In [68]:
data = data[data["Position"]==1]
data["Surname"].value_counts()
Out[68]:
Schumacher     121
Prost           79
Hamilton        70
Vettel          57
Alonso          52
Hill            48
Senna           48
Lauda           42
Fangio          35
Stewart         33
Rosberg         32
Fittipaldi      28
HÌ_kkinen       27
Mansell         25
Brabham         24
Clark           22
Button          20
Piquet          19
Scheckter       15
Andretti        15
Ascari          14
Reutemann       13
Hulme           12
RÌ_ikk̦nen     12
Jones           11
Rindt            8
Farina           7
Villeneuve       7
Moss             7
Alboreto         6
McLaren          6
Regazzoni        6
Webber           6
Irvine           5
Hawthorn         5
Pironi           4
Laffite          4
Taruffi          3
Arnoux           3
Coulthard        3
Watson           3
de Angelis       2
Trintignant      2
von Trips        2
Collins          2
Massa            2
RodrÌ_guez       1
Fisichella       1
Bandini          1
Hunt             1
Musso            1
Kubica           1
Depailler        1
Behra            1
Surtees          1
Name: Surname, dtype: int64
In [69]:
import matplotlib.pyplot as plt
#import plotly.plotly as py
import plotly.graph_objs as go

import chart_studio
import chart_studio.plotly as py
colors = ['grey'] * 55
colors[0] = 'crimson'

chart_studio.tools.set_credentials_file(username='pratikp', api_key='AHxozWNTaluAU7sx4YW7')

data1=[go.Bar(x=data["Surname"].value_counts().index, y=data["Surname"].value_counts().values, marker_color=colors)]

py.iplot(data1,filename='Wins by Drivers')
Out[69]:

Based on the bar chart plotted above, we can see that Michael Schumacher stands out in terms of number of wins registered in the history of Formula One. He has won a total of 121 races and has 7 championship titles in total. The difference is huge between him and Prost, who comes in at position 2 with 79 wins. All these statistics are enough to usggest that he is the GOAT (Greatest of all time)

In [70]:
unique_raceid_list = list(pitStopsCopy.index.unique())

Analysis 3: Decline in pitstop times

One important aspect in Formula One racing is pit stops. These play a major role in determining who goes on to win the race and depending on how efficiently time is managed during pit stops, wins are decided by milliseconds.

In 2010, tyre rules were changed. Bridgestone withdrew from the racing scene and in June 2011, Pirelli announced that it will be providiing tyres to teams. Refuelling was banned during races. These changes led to drastic decrease in ptistop times as can be seen in the line graph below.

In [71]:
pit_stops_average_df = pd.DataFrame(columns=['Year', 'Average_pit_time'])
for year in range(2011,2018):
#     print(year)


    race_list_for_year = racesCopy[racesCopy['Year']==int(year)].index.unique()

    mean_list_for_a_race = []

    for race in race_list_for_year:
    #     print(race)
        try:
            pit_time_list = pd.DataFrame(pitStopsCopy.loc[str(race)]['Duration'])
            mean_list_for_a_race += [pit_time_list['Duration'].astype(float).mean()]
        except:
            print(race)
            
    mean_list_for_a_race
    mean_dfrow_for_a_specific_year = pd.DataFrame(mean_list_for_a_race)

# float(mean_dfrow_for_a_specific_year.mean())

    pit_stops_average_df = pit_stops_average_df.append({'Year':int(year),'Average_pit_time': float(mean_dfrow_for_a_specific_year.mean())}, ignore_index=True) 


pit_stops_average_df
914
948
960
967
976
Out[71]:
Year Average_pit_time
0 2011.0 38.649453
1 2012.0 23.268415
2 2013.0 26.735497
3 2014.0 25.432787
4 2015.0 25.959985
5 2016.0 25.186868
6 2017.0 24.989076
In [72]:
pit_time_list = pd.DataFrame(pitStopsCopy.loc['969']['Duration'])
# mean = pit_time_list.mean()
meann = pit_time_list['Duration'].astype(float).mean()
#meann
In [73]:
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style("whitegrid")
sns.lineplot(x=pit_stops_average_df.Year, y=pit_stops_average_df.Average_pit_time);
plt.ylabel('Time')
plt.title('Year')
# plt.xticks(rotation='vertical');
Out[73]:
Text(0.5, 1.0, 'Year')

As we can see from the line graph plotted above, pitstop times took a huge plunge in 2012 after the introduction of new rules. there is an upward trend from 2012 to 2013 but that is because there were a lot of cases of explosive tyre bursts mid race and that was a major contributor in drivers having to visit pitstops more often. It was due to manufacturing issues in Pirelli tyres. Overall, pitstop times have been gradually declining over the years and are helping constructors achieve clocked times like never before.

Analysis 4: Which constructor has won the most races at a particular circuit?

In [74]:
racesCopy['circuitId']=racesCopy['circuitId'].astype('int')
In [75]:
racesCopy=racesCopy.reset_index()
In [76]:
circuitsCopy=circuitsCopy.reset_index()
In [77]:
circuitsCopy['circuitId']=circuitsCopy['circuitId'].astype('int')
In [78]:
#fig.update_layout(barmode='group', xaxis_tickangle=-45)
#fig.show()
In [79]:
data2=pd.merge(circuitsCopy, racesCopy, on='circuitId', how='left')
In [80]:
data2[['circuitId','Name_x','Location','country','raceId']]
Out[80]:
circuitId Name_x Location country raceId
0 1 Albert Park Grand Prix Circuit Melbourne Australia 1.0
1 1 Albert Park Grand Prix Circuit Melbourne Australia 18.0
2 1 Albert Park Grand Prix Circuit Melbourne Australia 36.0
3 1 Albert Park Grand Prix Circuit Melbourne Australia 55.0
4 1 Albert Park Grand Prix Circuit Melbourne Australia 71.0
... ... ... ... ... ...
993 71 Sochi Autodrom Sochi Russia 1004.0
994 72 Port Imperial Street Circuit New Jersey USA NaN
995 73 Baku City Circuit Baku Azerbaijan 955.0
996 73 Baku City Circuit Baku Azerbaijan 976.0
997 73 Baku City Circuit Baku Azerbaijan 992.0

998 rows × 5 columns

In [81]:
constructorStandingsCopy=constructorStandingsCopy.reset_index()
#constructorStandingsCopy
In [82]:
print(type(constructorStandingsCopy))
<class 'pandas.core.frame.DataFrame'>
In [83]:
data3=pd.merge(data2, constructorStandingsCopy, on='raceId', how='left')
In [84]:
pd.set_option("display.max_rows", 12000)
data3.head()
Out[84]:
circuitId CircuitRef Name_x Location country Latitude Longitude URL_x raceId Year Round Name_y Date Time URL_y Weather constructorStandingsId constructorId Points Position positionText Wins
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4046.0 6.0 0.0 9.0 9 0.0
1 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4044.0 2.0 0.0 7.0 7 0.0
2 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4045.0 9.0 0.0 8.0 8 0.0
3 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4043.0 10.0 0.0 6.0 6 0.0
4 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4042.0 5.0 3.0 5.0 5 0.0
In [125]:
data3[data3['Position']==1.0].head()
Out[125]:
circuitId CircuitRef Name_x Location country Latitude Longitude URL_x raceId Year Round Name_y Date Time URL_y Weather constructorStandingsId constructorId Points Position positionText Wins
8 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1.0 2009.0 1.0 Australian Grand Prix 3/29/2009 6:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... Sunny 4038.0 23.0 18.0 1.0 1 1.0
9 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 18.0 2008.0 1.0 Australian Grand Prix 3/16/2008 4:30:00 http://en.wikipedia.org/wiki/2008_Australian_G... NaN 1.0 1.0 14.0 1.0 1 1.0
24 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 36.0 2007.0 1.0 Australian Grand Prix 3/18/2007 3:00:00 http://en.wikipedia.org/wiki/2007_Australian_G... NaN 25733.0 6.0 13.0 1.0 1 1.0
36 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 55.0 2006.0 3.0 Australian Grand Prix 4/2/2006 14:00:00 http://en.wikipedia.org/wiki/2006_Australian_G... NaN 6339.0 4.0 42.0 1.0 1 3.0
37 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 71.0 2005.0 1.0 Australian Grand Prix 3/6/2005 14:00:00 http://en.wikipedia.org/wiki/2005_Australian_G... NaN 567.0 4.0 16.0 1.0 1 1.0
In [86]:
data8=data3.groupby(['circuitId','constructorId']).count()
In [87]:
#data8
In [88]:
data7=data3.groupby(['circuitId','constructorId']).count().sort_values(by='Wins',ascending=False)
In [89]:
#data7=data7.reset_index()
In [90]:
data9=data7
In [91]:
#data9
In [92]:
data10=data9.reset_index()
In [93]:
data11=pd.DataFrame(data10.groupby(by='circuitId')['Wins'].max())
In [94]:
#data10.groupby(['circuitId','constructorId']).sort_values(by='Wins',ascending=True).drop_duplicates(keep= 'last')
#data11
In [95]:
#data11=data10[['circuitId','constructorId','Wins']]
#data10
data12=pd.merge(data10, data11, on=['circuitId','Wins'], how='inner')
In [96]:
data13=data12[['circuitId','constructorId','Wins']]
In [126]:
data13.sort_values(by=['circuitId','constructorId'],ascending=True).head()
Out[126]:
circuitId constructorId Wins
24 1 6.0 22
27 2 1.0 19
28 2 3.0 19
26 2 6.0 19
44 3 1.0 13
In [99]:
#data13['Wins'] = data[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
In [127]:
data14=pd.merge(data13, circuitsCopy, on='circuitId', how='left')
data14[['circuitId','constructorId','Wins','Latitude','Longitude']].head()
Out[127]:
circuitId constructorId Wins Latitude Longitude
0 6 6.0 60 43.7347 7.42056
1 14 6.0 59 45.6156 9.28111
2 9 6.0 45 52.0786 -1.01694
3 13 6.0 43 50.4372 5.97139
4 7 6.0 38 45.5 -73.5228
In [101]:
import plotly.express as px

fig1 = go.Figure(go.Densitymapbox(lat=data14.Latitude, lon=data14.Longitude, z=data14.Wins, radius=25))
fig1.update_layout(mapbox_style="open-street-map", mapbox_center_lon=180)

fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig1.show()

In the above map, we have plotted the circuits and the number of maximum wins achieved by any constructor on that particular circuit.

Analysis 5: The most challenging circuits

Nelson Piquet once said,racing at Monaco—by far the most famous street circuit of them all—was "like trying to cycle around your living room."

Most races end up with some drivers retiring and not being able to finish the race. Primary reasons behind this are engine failures and difficult circuits.

The most challenging circuits are street circuits as they have a lot of barriers and these are the ones that face driver retirements more often than other circuits.

In [102]:
res=results
In [103]:
race=racesCopy
In [104]:
values = {'position': -1, 'number': -1, 'time': 0, 'milliseconds': 0}
res=res.fillna(value=values)
In [105]:
data4=pd.merge(res, race, on='raceId', how='left')
In [106]:
data4=data4[data4['position'] == -1.0]
In [107]:
data5=data4.groupby(['circuitId']).count().sort_values(by='raceId',ascending=False)
In [108]:
data5=data5.reset_index()
In [128]:
data5[['circuitId','number']].head()
Out[128]:
circuitId number
0 6 843
1 14 796
2 9 527
3 13 464
4 7 444
In [110]:
data6=pd.merge(data5, circuitsCopy, on='circuitId', how='left')
In [129]:
data6[['circuitId','number','Latitude','Longitude']].head()
Out[129]:
circuitId number Latitude Longitude
0 6 843 43.7347 7.42056
1 14 796 45.6156 9.28111
2 9 527 52.0786 -1.01694
3 13 464 50.4372 5.97139
4 7 444 45.5 -73.5228
In [112]:
#import plotly.graph_objects as go
fig = go.Figure(go.Densitymapbox(lat=data6.Latitude, lon=data6.Longitude, z=data6.number, radius=35))
fig.update_layout(mapbox_style="stamen-terrain", mapbox_center_lon=180)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

According to Bleacher Report, the 'Circuit De Monaco' is the most challenging circuit in the world and as seen in the visualization above, we can see that this circuit has had the most driver retirements standing at 843 followed by 'Autodromo Nazionale di Monza' in second.

Although the track in Monza is not a street circuit, it has high curve speeds which has resulted in a lot of fatal crashes and other accidents. Only drivers with proper techniques and superior aerodynamic efficiency are able to challenge for top spots here.

Another track which is very challenging is the 'Circuit Gilles Villeneuve' in Montreal, Canada and is rightly fourth on the list of most retired drivers.

Analysis 6: Formula 1- A global sport

In [115]:
circuitMap['Raceloc'] = 0
circuit_id_list = list(race2[race2['Year'] == 2018]['circuitId'])

circuit_id_list
Out[115]:
[1, 3, 17, 73, 4, 6, 7, 34, 70, 9, 10, 11, 13, 14, 15, 71, 22, 69, 32, 18, 24]
In [117]:
circuit_id_list_notlive_i = list(race2[race2['Year'] != 2018]['circuitId'])
circuit_id_list_notlive = [x for x in circuit_id_list_notlive_i if x not in circuit_id_list]
# len(circuit_id_list_notlive)
# len(circuit_id_list)

# circuitsCopy[~circuitsCopy['circuitId'].isin(circuit_id_list)]['Raceloc'] = 1
circuit_id_list 
# circuit_id_list

for rowId in circuit_id_list:
     circuitMap.loc[str(rowId),'Raceloc'] = 1
In [130]:
circuitMap.head()
Out[130]:
CircuitRef Name Location country Latitude Longitude URL Raceloc
circuitId
1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.8497 144.968 http://en.wikipedia.org/wiki/Melbourne_Grand_P... 1
2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 http://en.wikipedia.org/wiki/Sepang_Internatio... 0
3 bahrain Bahrain International Circuit Sakhir Bahrain 26.0325 50.5106 http://en.wikipedia.org/wiki/Bahrain_Internati... 1
4 catalunya Circuit de Barcelona-Catalunya MontmelÌ_ Spain 41.57 2.26111 http://en.wikipedia.org/wiki/Circuit_de_Barcel... 1
5 istanbul Istanbul Park Istanbul Turkey 40.9517 29.405 http://en.wikipedia.org/wiki/Istanbul_Park\n 0
In [121]:
import seaborn as sns
import plotly.graph_objects as go
mapbox_access_token = 'pk.eyJ1Ijoic3VtdWtoc2hhcm1hIiwiYSI6ImNrMm92bjJydzE3c3UzcHQ1Nzdjb3JwNHMifQ.x_ybvMGOCORXoXDYqnJCCg'
class_map_data1 = go.Scattermapbox(
        lon = circuitMap[circuitMap['Raceloc']!=1]['Longitude'],
        lat = circuitMap[circuitMap['Raceloc']!=1]['Latitude'],
        text = circuitMap[circuitMap['Raceloc']!=1]['Name'],
        hoverinfo='text',
        mode = 'markers',
        marker = dict(
                    color = 'red',
                    symbol = 'circle',
                    opacity = .5
                ),
        name = "Not live circuits"
)
class_map_data2 = go.Scattermapbox(
        lon = circuitMap[circuitMap['Raceloc']==1]['Longitude'],
        lat = circuitMap[circuitMap['Raceloc']==1]['Latitude'],
        text = circuitMap[circuitMap['Raceloc']==1]['Name'],
        hoverinfo='text',
        mode = 'markers',
        marker = dict(
                    color = 'green',
                    symbol = 'circle',
                    opacity = 1
                ),
        name = "Live circuits"
)

class_map_layout = go.Layout(
        title = 'Circuit Locations',
        mapbox=go.layout.Mapbox(
            accesstoken=mapbox_access_token,
            zoom=1
        )
    )

class_map = go.Figure(data=[class_map_data1,class_map_data2], layout=class_map_layout)
class_map.show()

As can be seen in the map above, we have some circuits that are live and some that aren't. This just shows how globally spread out the sport is.